Modelling COVID-19 infection across England
In this tutorial we’ll cover some work on fitting a purely spatial Bayesian model to predict the COVID-19 infection rate across England.
Getting setup
This tutorial makes use of the CARBayes package which is not a dependency of 4DModeller, to follow through the tutorial you will need to install it.
install.packages("CARBayes")Study aim and data description
This study describes how to fit a purely spatial Bayesian hierarchical model (BHM) based on Markov Chain Monte Carlo (MCMC) simulation method to estimate the spatial pattern of COVID-19 infection rate in England. The first thing is to load all the packages used in the COVID-19 case study.
The study region is mainland England, which is partitioned into 6789
neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The
infections data are the total reported number of COVID-19 cases in each
MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study
region is a SpatialPolygonsDataFrame, which is used to map
the data. It stores the location, shape and attributes of geographic
features for the neighbourhoods.
We first need to retrieve the infections data from the fdmr example data store and unpack it and we’ll use retrieve_tutorial_data to do this.
fdmr::retrieve_tutorial_data(dataset = "covid_mcmc")##
## Tutorial data extracted to /home/runner/fdmr/tutorial_data/covid_mcmc
The COVID-19 data and the related covariate information are included
in our tutorial data package. We’ll load in the data using the
load_tutorial_data function.
s_covid <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "s_covid.rds")Next we’ll use the load_tutorial_data function to load
in the spatial data we want.
sp_data <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "spatial_data.rds")In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric \(n \times n\) neighbourhood or adjacency matrix \(\boldsymbol{W}\) that accounts for the spatio-temporal autocorrelation structure, where \(n=6789\) is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements \(\{w_{ij}\}\) in \(\boldsymbol{W}\) can be either continuous or binary, and a larger value of \(w_{ij}\) represents that MSOAs \((i,j)\) are spatially closer to each other. Here we use the border sharing specification, so that \(w_{ij}=1\) if MSOAs \((i,j)\) share a common geographical border, and \(w_{ij}=0\) otherwise.
## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0
w <- spdep::nb2mat(W_nb, style = "B")Model specification
We use a Bayesian hierarchical model to predict the spatial COVID-19 infection rate at the MSOA level in England. Let \(Y_{i}\) denote the total number of reported COVID cases for neighbourhood \(i=1,\ldots, n(=6789)\) during the study period, and \(N_{i}\) denote the (official) estimated population living in MSOA \(i\). \(Y_{i}\) is assumed to have a Poisson distribution with parameters (\(N_{i}\), \(\theta_{i}\)), where \(\theta_{i}\) is the true unobserved COVID-19 infection rate in MSAO \(i\). We follow a standard path in modelling \(\theta_{i}\) with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by
\[\begin{align} \nonumber Y_{i}\vert N_{i}, \theta_{i} &\sim \text{Poisson}(N_{i}\theta_{i}),\ \ i=1,\ldots,n,\\ log(\theta_{i} )&=\boldsymbol{x_{i}^{\top}}\boldsymbol{\beta}+\phi_{i}. \end{align}\]
The spatial random effects \(\{\phi_i\}\) are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates \(\boldsymbol{x_{i}}\). Here we utilise the spatial modelling structure “BYM”, proposed by Besag, York, and Mollié (1991), to model \(\{\phi_i\}\). It is given by
\[\begin{align} \nonumber \phi_i &=\phi_i^{(1)}+\phi_i^{(2)}\\ \phi_i^{(1)}\vert\boldsymbol\phi_{-i}^{(1)}&\sim \text{N}\left( \frac{\sum_{j=1}^{n}w_{ij}\phi_j^{(1)}}{\sum_{j=1}^{n}w_{ij}}, \frac{\tau_1^2}{\sum_{j=1}^{n}w_{ij}}\right)\\ \nonumber \phi_i^{(2)}&\sim \text{N}(0, \tau_2^2), \end{align}\]
where \(\phi_i\) now consists of two components. \(\phi_i^{(1)}\) is assigned the intrinsic CAR prior (Besag et al., 1991), and \(\phi_i^{(2)}\) is a set of independent and identically normally distributed random effects, with mean zero and common variance \(\tau_2^2\).
Define the model formula
In order to fit the spatial model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. First, we consider the scenario of including no covariates.
Fit the model
Finally, we fit the spatial model using the function
S.CARbym() of the package CARBayes developed
by Lee (2013).
MCMC_model <- CARBayes::S.CARbym(
formula = form,
data = s_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10,
verbose = FALSE
)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning in mat2listw(W): style is M (missing); style should be set to a valid
## value
For comparison purpose, we fit a separate BHM to the same dataset using the INLA approach. Likewise, it uses the BYM to model the spatial random effects.
s_covid$ID <- seq(1, nrow(s_covid))
formula <- total.cases ~ 1 + INLA::f(ID,
model = "bym",
graph = w
)
INLA_model <- INLA::inla(formula,
data = s_covid,
family = "poisson",
E = s_covid$population,
control.compute = list(
dic = TRUE,
waic = TRUE,
cpo = TRUE
),
verbose = FALSE
)## Error in model.frame.default(new.fix.formula, data.same.len, na.action = inla.na.action) :
## invalid type (list) for variable 'INLA::f(ID, model = "bym", graph = w)'
##
## *** inla.core.safe: inla.program has crashed: rerun to get better initial values. try=1/2
## Error in model.frame.default(new.fix.formula, data.same.len, na.action = inla.na.action) :
## invalid type (list) for variable 'INLA::f(ID, model = "bym", graph = w)'
##
## *** inla.core.safe: inla.program has crashed: rerun to get better initial values. try=2/2
## Error in model.frame.default(new.fix.formula, data.same.len, na.action = inla.na.action) :
## invalid type (list) for variable 'INLA::f(ID, model = "bym", graph = w)'
## Error in inla.core.safe(formula = formula, family = family, contrasts = contrasts, : *** Fail to get good enough initial values. Maybe it is due to something else.
Finally, we fit a BHM to the same dataset using the INLA-SPDE approach.
initial_range <- diff(range(sp_data@data[, "LONG"])) / 5
max_edge <- initial_range / 8
mesh <- INLA::inla.mesh.2d(
loc = sp_data@data[, c("LONG", "LAT")],
max.edge = c(1, 2) * max_edge,
offset = c(initial_range / 4, initial_range),
cutoff = max_edge / 7
)
prior_range <- initial_range
spde <- INLA::inla.spde2.pcmatern(
mesh = mesh,
prior.range = c(prior_range, 0.5),
prior.sigma = c(1, 0.01)
)
s_covid_cp <- s_covid
sp::coordinates(s_covid_cp) <- c("LONG", "LAT")
cmp <- total.cases ~ 0 + Intercept + f(main = coordinates, model = spde)
inlabru.model <- inlabru::bru(cmp,
data = s_covid_cp,
family = "poisson",
E = s_covid_cp$population,
control.family = list(link = "log"),
options = list(
verbose = FALSE
)
)## Warning in add_mapper(component$main, label = component$label, lhoods = lh, : All covariate evaluations for 'Intercept' are NULL; an intercept component was likely intended.
## Implicit latent intercept component specification is deprecated since version 2.1.14.
## Use explicit notation '+ Intercept(1)' instead (or '+1' for '+ Intercept(1)').
## Warning in handle_problems(e_input): The input evaluation 'Intercept' for
## 'Intercept' failed. Perhaps the data object doesn't contain the needed
## variables? Falling back to '1'.
## Warning in handle_problems(e_input): The input evaluation 'coordinates' for 'f'
## failed. Perhaps the data object doesn't contain the needed variables? Falling
## back to '1'.
## Warning in fm_onto_mesh(mesh, loc, crs = crs): Unclear if the 'loc' class
## ('numeric') is of a type we know how to handle.
## Error in matrix(is.na(as.vector(loc)), nrow = nrow(loc), ncol = ncol(loc)): non-numeric matrix extent
Model comparison
In terms of model selection criteria, we show the different values for each model based on DIC and WAIC. The model fitted using the INLA-SPDE approach performs better than the other two models in terms of the lowest DIC and WAIC values.
modfit <- data.frame(
DIC = c(MCMC_model$modelfit[1], INLA_model$dic$dic, inlabru.model$dic$dic),
WAIC = c(MCMC_model$modelfit[3], INLA_model$waic$waic, inlabru.model$waic$waic)
)## Error in eval(expr, envir, enclos): object 'INLA_model' not found
## Error: object 'modfit' not found
modfit## Error in eval(expr, envir, enclos): object 'modfit' not found
We compare the posterior COVID-19 infection rate estimates between the models. In general, all models provide similar posterior COVID-19 infection rate estimates.
mcmc_fitted_prev <- exp(mean(MCMC_model$samples$beta) + apply(MCMC_model$samples$psi, 2, mean))
mcmc_lc <- exp(quantile(MCMC_model$samples$beta, 0.025) +
apply(MCMC_model$samples$psi, 2, quantile, 0.025))
mcmc_uc <- exp(quantile(MCMC_model$samples$beta, 0.975) +
apply(MCMC_model$samples$psi, 2, quantile, 0.975))
comb <- data.frame(
"INLA_BYM" = INLA_model$summary.fitted.values$mean,
"INLA_lc" = INLA_model$summary.fitted.values$`0.025quant`,
"INLA_uc" = INLA_model$summary.fitted.values$`0.975quant`,
"INLA_SPDE" = inlabru.model$summary.fitted.values$mean[1:nrow(s_covid)],
"INLA_SPDE_lc" = inlabru.model$summary.fitted.values$`0.025quant`[1:nrow(s_covid)],
"INLA_SPDE_uc" = inlabru.model$summary.fitted.values$`0.975quant`[1:nrow(s_covid)],
"MCMC" = mcmc_fitted_prev,
"MCMC_lc" = mcmc_lc,
"MCMC_uc" = mcmc_uc
)## Error in eval(expr, envir, enclos): object 'INLA_model' not found
pairs(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")],
pch = 19, cex = 0.3, col = "orange", lower.panel = panel.smooth
)## Error in eval(expr, envir, enclos): object 'comb' not found
## Error in eval(expr, envir, enclos): object 'comb' not found
The spatial patterns of the infection rate estimates for each model are displayed below.
sp_data@data$est.rate.mcmc <- mcmc_fitted_prev
domain <- sp_data@data$est.rate.mcmc
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)Map of the predicted infection rates for using MCMC.
sp_data@data$est.rate.inlabym <- INLA_model$summary.fitted.values$mean## Error in eval(expr, envir, enclos): object 'INLA_model' not found
domain <- sp_data@data$est.rate.inlabym
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)Map of the predicted infection rates for using INLA-BYM.
sp_data@data$est.rate.inlaspde <- inlabru.model$summary.fitted.values$mean[1:nrow(s_covid)]## Error in eval(expr, envir, enclos): object 'inlabru.model' not found
domain <- sp_data@data$est.rate.inlaspde
fdmr::plot_map(
polygon_data = sp_data,
domain = domain,
palette = "Reds",
legend_title = "Rate",
add_scale_bar = TRUE,
polygon_fill_opacity = 0.8,
)Map of the predicted infection rates for using INLA-SPDE.
Now we consider the scenario of fitting the models with covariates of interest. We select several risk factors used in the COVID-19 tutorial https://4dmodeller.github.io/fdmr/articles/covid.html. The models described above are fitted and compared again after incorporating covariate information. First we fit the BHM using MCMC method.
form <- total.cases ~ 1 + offset(log(population)) +
IMD + perc.wb + perc.ba + age1 + pm25
MCMC_model2 <- CARBayes::S.CARbym(
formula = form,
data = s_covid,
family = "poisson",
W = w,
burnin = 10000,
n.sample = 30000,
thin = 10,
verbose = F
)## Warning in mat2listw(W): style is M (missing); style should be set to a valid
## value
Next we fit the BHM using the INLA-BYM approach.
formula <- total.cases ~ 1 + f(ID,
model = "bym",
graph = w
) + IMD + perc.wb + perc.ba + age1 + pm25
INLA_model2 <- INLA::inla(formula,
data = s_covid,
family = "poisson",
E = s_covid$population,
control.compute = list(
dic = TRUE,
waic = TRUE,
cpo = TRUE
),
verbose = FALSE
)Next we fit the BHM using the INLA-SPDE approach.
cmp <- total.cases ~ 0 + Intercept + f(main = coordinates, model = spde) +
IMD + perc.wb + perc.ba + age1 + pm25
inlabru_model2 <- bru(cmp,
data = s_covid_cp,
family = "poisson",
E = s_covid_cp$population,
control.family = list(link = "log"),
options = list(
verbose = FALSE
)
)## Error in bru(cmp, data = s_covid_cp, family = "poisson", E = s_covid_cp$population, : could not find function "bru"
The regression coefficients of the selected covariates for all models are compared. In general, the models have similar regression coefficients estimates.
regr_est <- cbind.data.frame(
"MCMC" = MCMC_model2$summary.results[1:(nrow(MCMC_model2$summary.results) - 2), 1],
"INLA_BYM" = INLA_model2$summary.fixed[, 1],
"INLA_SPDE" = inlabru_model2$summary.fixed[, 1]
)## Error in eval(expr, envir, enclos): object 'inlabru_model2' not found
regr_est## Error in eval(expr, envir, enclos): object 'regr_est' not found
